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Abstract 



We consider linear mixed models in which the observations are grouped. A 
penalization on the fixed effects coefficients of the log-likelihood obtained by consid- 
ering the random effects as missing values is proposed. A multicycle ECM algorithm 
is used to solve the optimization problem; it can be combined with any variable 
selection method developed for linear models. The algorithm allows the number of 
parameters p to be larger than the total number of observations n; it is faster than 



the ImmLasso ( Schelldorfer et al. , 2011 ) since no n x n matrix has to be inverted. We 



show that the theoretical results of Schelldorfer et al. (2011) apply for our method 
when the variances of both the random effects and the residuals are known. The 



combination of the algorithm with a variable selection method (Rohart, 2011) shows 



good results in estimating the set of relevant fixed effects coefficients as well as esti- 
mating the variances; it outperforms the ImmLasso both in the common case {p < n) 
and in the high-dimensional case {p >n). 



1 Introduction 

More and more real data sets are high-dimensional data because of the widely-used new 
technologies such as high-thoughput DNA/RNA chips or RNA seq in biology. The high- 
dimensional setting -in which the number of parameters p is greater than the number of 
observations n- generally implies that the problem can not be solved. In order to address 
this problem, some conditions are usually added such as a sparsity condition -which means 
that a lot of parameters are equal to zero- or a well-conditioning of the variance matrix 
of the observations, among others. A lot of work has been done to address the problem 
of variable selection, mainly in a linear model Y = XP + e, where X is an n x p matrix 
containing the observations and e is a n-vector of i.i.d random variables, usually Gaussian. 
One of the oldest method is the Akaike Information Criterion (AIC), which is a penalization 
of the log-likelihood by a function of the number of parameters included in the model. More 
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recently, the Lasso (Least Absolute Shrinkage and Selection Operator) (Tibshirani, 1996) 
revolutionized the field with both a simple and powerful method: ^^-penalization of the 
least squares estimate which exactly shrinks to zero some coefficients. The Lasso has some 



extensions, a group Lasso (Yuan and Lin, 2007), an adaptive Lasso (Huang et al. , 2008) 



and a more stable version known as BoLasso (Bach 2009), for example. A penalization on 
the likelihood is not the only way to perform variable selection. Indeed statistical testing 
has also been used recently (Rohart, 2011) and it appears to give good results. 

In all methods cited above, the observations are supposed to be independent and iden- 
tically distributed. When a structure information is available, such as family relationships 
or common environmental effects, these methods are no longer adapted. In a linear mixed 
model, the observations are assumed to be clustered, hence the variance- covariance matrix 
V of the observations is no longer diagonal but could be assumed to be block diagonal 
in some cases. A lot of literature about linear mixed models concerns the estimation of 



the variance components, either with a maximum likelihood estimation (ML) (Henderson 



1973 1953) or a restricted maximum likelihood estimation (REML) which accounts for 



the loss in degrees of freedom due to fitting fixed effects (Patterson and Thompson, 1971 



Harville 1977; Henderson 1984; FouUey et al. , 2006). However, both methods assume that 



each fixed effect and each random effect is relevant. This assumption might be wrong and 
leads to false estimation of the parameters, especially in a high- dimensional analysis. Con- 
trary to the linear model, there is little literature about selection of fixed effects coefficients 
in a linear mixed model in a high-dimensional setting. 

Both Bondell et al. (2010) and Ibrahim et al. (2011) used a penalized likelihood to per- 
form selection of both the fixed and the random effects. However, their simulation studies 
were only designed in a low dimensional context. Bondell et al. (2010) introduced a con- 
strained EM algorithm to solve the optimization problem, however the algorithm does not 



really cope with the problem of high dimension. To our knowledge, only Schelldorfer et al. 



(2011) studied the topic in a high dimensional setting. Their paper introduced an algo- 
rithm based on a ^^-penalization of the maximum likelihood estimator in order to select the 
relevant fixed effects coefficients. As highlighted in their paper, their algorithm relies on 
the inversion of the variance matrix of the observations V, which can be time-consuming. 
Finally, their method depends on a regularization parameter that has to be tuned, as for 
the original Lasso. As this question remains an open problem, they proposed the use of 
the Bayesian Information Criterion (BIC) to choose the penalty. 

All methods are usually considered with one grouping factor -meaning one partition of 
the observations-, which can be sometimes misappropriate when the observations are di- 
vided w.r.t two factors or more; for instance when a family relationship and a common 
environmental effect are considered. 

We present in this paper another way to perform selection of the fixed effects in a 
linear mixed model. We propose to consider the random effects as missing data, as done 
Bondell et al. (2010) or in FouUey (1997), and to add a ^^-penalization on the log- 



in 



likelihood of the complete data. Our method allows the use of several different grouping 



factors. We propose a multicycle ECM algorithm (FouUey 1997 McLachlan and Krishnan 



2008 Meng and Rubin , 1993 ) to solve the optimization problem; this algorithm possesses 
convergence properties. In addition, we show that the use of BIC in order to tune the 
regularization parameter as proposed by Schelldorfer et al. (2011) could sometimes turn 
out to be misappropriate. 
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We give theoretical results when the variances of the observations are known. Due to the 
design of the algorithm that is decomposed into steps, the algorithm can be combined with 
any variable selection method built for linear models. Nevertheless, the performance of 
the combination strongly depends on the variable selection method that is used. As there 
is little literature on the selection of the fixed effects in a high-dimensional linear mixed 
model, we will mainly compare our results to those of Schelldorfer et al. (2011). 

This paper extends the analysis on a real data-set coming from a project in which 
hundreds of pigs have been studied. The aim is to enlighten relationships between some 
phenotypes of interest and metabolomic data (Rohart et al. , 2012). Linear mixed mod- 
els are appropriate since the observations are repeated data from different environments 
(groups of animals are reared together in the same conditions). Some individuals are also 
genetically related, in a family effect. The data set consists in 506 individuals from 3 
breeds, 8 environments and 157 families. The metabolomic data contains p = 375 vari- 
ables. We will investigate the Daily Feed Intake (DFI) phenotype. 



This paper is organized as follows: we will first describe the linear mixed model and the 
objective function, then we will present the multicycle ECM algorithm that is used to solve 
the optimization problem of the objective function. Section [3] gives a generalization of the 
algorithm of Section [2] that can be used with any variable selection method developed for 
linear models. Finally, we will present results from a simulation study showing that the 
combination of this new algorithm with a good variable selection method performs well, 
in terms of selection of both the fixed and random effects coefficients (Section |4]), before 
applying the method on a real data set in Section [5j 



2 The method 

Let us introduce some notations that will be used throughout the paper. Var{a) denotes 
the variance- covariance matrix of the vector a. For all a > 0, set to be the identity 
matrix of M."-. For A G M"^^, let Aj^j A,^j and Aj^, denote respectively the submatrix of 
A composed of elements of A whose rows are in / and columns are in J, whose columns 
are in J with all rows, and whose rows are in / with all columns. Moreover, we set for all 
a > 0, 6 > 0, Oa to be the vector of size a with all its coordinates equal to and Oaxt to be 
the null matrix of size a x b. Let us denote \A\ the determinant of matrix A. 

2.1 The linear mixed model setup 

We consider the linear mixed model in which the observations are grouped and we suppose 
that only a small subset of the fixed effects coefficients are non-zero. The aim of this paper 
is to recover this subset through an algorithm that will be presented in the next section. 
In the present section we explicit the linear mixed model and our objective function. 

Mixed models are often considered with a single grouping factor, meaning that each 
observation belongs to one single group. In this paper we allow several grouping factors. 
Assume there are q random effects and q grouping factors (g > 1), where some grouping 
factors may be identical. The levels of the factor k are denoted {1, 2, . . . , A^^}. The i*^- 
observation belongs to the groups {ii, . . . ,iq), where for all I = 1, ... ,q, ii E {1,2, ... , Ni}. 
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We precise that two observations can belong to the same group of one grouping factor 
whereas they can belong to different groups of another grouping factor. 

Let n be the total number of observations with n = ni^k,^k < q, where rii^k is the 



number of observations within group i from the grouping factor k. Denote N = ^2 
The linear mixed model can be written as 



X/3 + ^ ZkUk + e, 



(1) 



k=l 



where 



y is the set of observed data of length n, 

(3 is an unknown vector of W; (3 = . . . , f3p), 

X is the n X p matrix of fixed effects; X = {Xi, 



For k = 
factor k, 

For k = 
factor k. 



, g, Mfc is a A^fc- vector of the random effect corresponding to the grouping 
. ,q, Zk is a. n X Nk incidence matrix corresponding to the grouping 



e = (ei, . . . , e„)' is a Gaussian vector with i.i.d. components e ~ A/'n(0, cTg/„), where 
(Je is an unknown positive quantity. We denote by R the variance-covariance matrix 
of e, R 



To fix ideas, let us give a example of matrices Zk for n = 6 and two random effects. 
/I 0\ /xi 0\ 



Let Zi 



1 

1 

1 

1 

VO I) 



and 



X2 


















X5 



The grouping factors 1 and 2 are the same 



for the two random effects Ui and U2, and Z2 is the incidence matrix of the interaction of 
the variable x = (xi, . . . , xg) and the grouping factor. 

Throughout the paper, we assume that Uk ~ AfNk{0,o'k^Nk)y where cxfc is an unknown 

,u'f^y, Z the concatenation of {Zi,...,Zq), 
and r the block diagonal matrix of 



positive quantity. We denote u = {u[, 
G the block diagonal matrix of afl^i,- 
iJni, ■ ■ -^IqlN,, where 7^ = a^/al 
Remark that with these notations. Model ([T| can also be written as: y = X(3 + Zu + e. 
In the following, we assume that e, mi, . . . , Uq are mutually independent. Thus Var{ui, . . . , 

i?^ ■ consider the matrices X and {Z^}^ ^ to be fixed design. 



Note that our model ([T]) and the one in |Schelldorfer"iraL] ( |20ll| ) are almost identical 
when all the grouping factors are identical, except that we supposed ui . . . , Ug to he inde- 
pendent while they did not make this assumption. Nevertheless, for their simulation study, 
they considered i.i.d. random effects. 
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Let us denote by J the set of the indices of the relevant fixed effects of Model ([T]); 
J = {j,f3j 7^ 0}. The aim of this paper is to estimate J, (3, G and R. In the whole 
paper, the number of fixed effects p can be larger than the total number of observations n. 
However, we focus on the case where only a few fixed-effects are relevant. We also assume 
that only a few grouping factors are included in the model since this paper was motivated 
by such a case on a real data set, see Section |5] Hence we assume + | J| < n. 



2.2 A £^ penalization of the complete log-likelihood 

In the following, we consider the fixed effects coefficients /3 and the variances af, . . . ,ag,al 
as parameters and {uk}j^^^i as missing data. We denote $ = {(3,af, . . . , cXg, a^). 
The log-likelihood of the complete data x = {y, u) is 



where 



L(<l>; x) = Lo(/3, al, a^, . . . , aj; e) + ^ Lk{al; 



k=l 



(2) 



-2Lo(/3,ag,ai, 



n log(27r) + n log(a 



X/3 - ^ ZkUk 



k=l 



/al (3a) 



VA; G {1, . . . , g} , -2Lk{al; Uk) = log(27r) + A^^ log(a^) + | \uk\f /al 



(3b) 



Indeed, ^ comes from p{x\^) 
Lo(/3, a^, cTi, . . . , a^; e) = Lo{al, e) = n\og{27i)+n\og{al)+e'e/a'^^ because eja^ 
and (3b) from Uk\crl ~ J\fN^^{0, all^^). 



(3a) comes from 

^fnio 



a 



2/ 

r) 



Since we allow the number of fixed-effects p to be larger than the total number of 
observations n, the usual maximum likelihood (ML) or restricted maximum likelihood 
(REML) approaches do not apply. As we assumed that f3 is sparse -many coefficients are 
assumed to be null- and since we want to recover that sparsity, we add a £^ penalty on 
P to the log-hkelihood of the complete data Indeed a penahzation is known to 



induce sparsity in the solution, as in the Lasso method (Tibshirani, 1996) or the ImmLasso 



method ( Schelldorfer et al. , 2011). Thus we consider the following objective function to be 
minimized: 

(7($;x) = -2L(<l>;x) + A|/3|i, (4) 

where A is a positive regularization parameter. Remark that the function g could have 
been obtained from a Bayesian setting considering a Laplace prior on f3. 
It is interesting to note that finding a minimum of the objective function (|4]) is a non-linear, 
non-differentiable and non convex problem. But more importantly, one thing that strikes 



out -especially from (3b )- is that the function g is not lower-bounded. Indeed, L($; x) tends 
to infinity when both Uk and ak tends toward 0. It is a well-known problem of degeneracy 



of the likelihood, especially studied in Gaussian mixture model (Biernacki and Chretien 



2003 ) but not much concerning mixed models. In linear mixed models, some authors focus 



on the log-likelihood of the marginal model in which the random effects are integrated out 



in the matrix of variance of the observations Y, such as in Schelldorfer et al. (2011): 



y = Xl3 + e, where e ~ A/'(0, V). 
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Note that V = ZGZ' + R. The degeneracy of the hkehhood can also appear in the marginal 
model when the determinant of V tends toward zero. This phenomenon is likely to happen 
in a high dimensional context when too much fixed-effects enter the model, that is to say 



when the amount of regularization chosen by the penalty of the ImmLasso (Schelldorfer 
et al. , 2011) or by A in (|4]) is not large enough. 



Because of the non lower-boundness of the likelihood, the problem of minimizing the 
function g is ill-posed: we are not interested in the minimization of g on the parameter 
space {/3 e RP, a? > 0, . . . , > 0, > O} but more interested in minimizing g inside the 



parameter space 



A = {/3g 



> 0, 



>0}. 



Instead of adding a i penalty on the random effect as Bondell et al. (2010), we will 



use the degeneracy of the likelihood at the frontier of the parameter space A to perform 
selection of the random effects. Indeed, if it exists 1 < k < q such that the minimization 
process of the function g, defined by Q, takes place at the frontier = of the param- 
eter space A, then the grouping factor k is deleted from the model ([T]). Nevertheless, our 
method is more restrictive than the one of Bondell et al. (2010) since we assume A^+| J| < n. 



The minimization process of the function g can coincide with the deletion of the random 
effect k, for 1 < k < q, for two reasons: either the true underlying model was different from 
the fitted one -some grouping factors are included in the model although there is no need 
to-, or because the initialization of the minimization process was to close to an attraction 
domain of (ufc,(J^) = (OArj.,0) (Biernacki and Chretien, 2003). 



When selection of the random effects is performed in the linear mixed model ([T| with q 
random effects, a new model is fitted with q — 1 grouping factor and the objective function 
is modified accordingly. The selection of the random effects can be performed until no 
grouping factor remains, then a linear model is considered. 

In the next section we will use a multicycle ECM algorithm in order to solve the 
minimization of (|4]); it performs selection of both the fixed and the random effects. 



2.3 A multicycle ECM algorithm 



The multicycle ECM algorithm (Meng and Rubin 1993 FouUey 1997 McLachlan and 
Krishnan, 2008) used to solve the minimization problem of (|4|) contains four steps -two E 
steps interlaced with two M steps-; each will be described in this section. 
Recall that $ = {(3,af, . . . ,cr^,o"g) is the vector of the parameters to estimate and that 
u = {u[, . . . ,u'i^y is a vector of missing values. For the sake of simplicity, we denote 
}C = {l,...,q} and = Wl},^^,^. 

The multicyle ECM algorithm is an iterative algorithm. We will index the iterations by 
t eN. 0'*] will denote the current estimation of the parameter at iteration t. 
Let K,i„ 

,$=$[<] denote the conditional expectation under the distribution of u given the 
vector of observations y and the current estimation of the set of parameters $ at iteration 
t. 



2.3.1 First E-step 

Let denote 
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We can decompose Q as follows: 

k=l 

where 

go($; = n log(27r) + n log(a,2W) + (e'e)^''*' + \W% 

and 

VA; G /C,Qfc(a2; $W) = iVlog(27r) + iVlog(a^t*') + K'^fc)/^?*'- 

By definition, we have for all 1 < i < ra, \/ar„|y (e^) = (e^)-] jE^i^^ (ei)||' 

Hence 

We can then explicit 

(e'e) = \\y- X/jW - ZE {u\y, $ = <I>M) \f + tr [ZVar {u\y, $W) Z') . (5) 



According to the denomination of Henderson (1973), E {^u\y, $ = $M) is the BLUP (Best 



Linear Unbiased Prediction) of u for the vector of parameters $ equal to Let us denote 



2.3.2 M-Step for /3 

The next step performs a minimization of Qoi(3, cr^, crl; $M) with respect to (3: 

p[t+i] ^ Argmin \\{y - Zu^'+^/^^) - X (5\\ + \\(5\\ . (6) 

Remark that (|6]) is a Lasso on (3 with the vector of "observed" data [y — Zu^'^^^^'^^^ and 
the penalty Acxe'*^. 



2.3.3 Second E-Step 

A second E-step is performed with the actualization of the vector of missing values u: 
= E(u\y,^ = a? = ^?\ • • • , = 4\cT'e = ^e'*^) , thus 

= (Z'Z + rW)-iZ' [y - X/3[*+i]) . 

We define VA; G /C, m^^^' to be the element of size that corresponds to the grouping 
factor k in m'*^"'^^. 



7 



2.3.4 M-step for {aj, ...,(7^ al) 

The actualization of the variances {o"fc}i<fe<g and a1 are performed with the minimization 
of {Qk} i<k<q Qo respectively. 

Let G /C, the minimization of Qk with respect to al gives: 
Besides, 



Moreover we have, thanks to Henderson ( 1973 ) 
where T^^k is defined as follows: 



, \\\tr (var [uk\y, af\af\p^'^'^ 



(z'z + rW)"' = 



Z'2Zi Z2Z2 + '^^2^N2 



Z[Zq 



-1 



V z'^z, 

^Ti,! ri,2 ... TiA 

T'l rp rp 

1,2 -'2,2 • • • J-'. 



Z',Z2 



Z'^Z.^^flJ 



2,9 



Thus, for all /c e /C: 



0" 



2[t+l] 



u 



The minimization of Qo with respect to a\ gives: cxe'*^^' = ^=$[t] (e'e)/?7,. From ([s]), we 
have 



2[t+l] 



2[t] 



Since 



tr [Z {Z' Z ' Z'^ = tr((Z'Z + rW) 'z'Z 

= AT-tr [(z'z + rw)~'rw 



A:=l 



we have 



2[i+l] 



1 



n 



In summary, the algorithm is the following: 



fc=i 
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Algorithm 2.1 (Lasso+). Initialization: 

Set /C = {1, . . . , g}. Initialize the set of parameters $M = (cr^^°', cTe^^', 

Define as t/ie diagonal matrix of^f'^lNi, ■ ■ ■ ,lf^lNq, where 7!°' = cre'^V'^fc^' 



Zq and u 



Define Z as the concatenation of Zi, 
Until convergence: 
1. E-step 

„[t+i/2] ^ ^z'z + rW)-iz'(i/ - X/3W) 

^[i+i] = ^rc/mm f 1 1 (y - - I V \af^ \(3\-^ 

3. E-step 

= {Z'Z + T^'^)-^Z'{y - 



l! • • • ! "5 



('aj For A; in K,, set crf'*^"'^' 



1 

n 



/Nk + tr {Tk,k)af^/Nk 



^ /iVfe < lO-Ve^*') then /C = /C\ {A;} 



(^c^ For k in K,, if 



u 



a. 



Define Z as the concatenation of {Zk}f^^j^ and u as the transpose of the concatenation of 



Set rl*+^l as the block diagonal matrix of \ 'it^^^^Ni. \ , where for all k G IC, 

I J k&JC 

end 



The convergence of Algorithm 2.1 is ensured since it is a multicycle ECM algorithm 



(Meng and Rubin 1993) 



Three stopping criteria are used to stop the convergence process of the algorithm: a con 



dition on 



a condition on 



\u 



u 



for each random effect Uk and 



a condition on | |L($[*+^], x) — where L($,x) is the log-likelihood defined by 

([2]). The convergence takes place when all the criteria are fulfilled. We also add a fourth 
con dition that controls the number of iterations. We choose to initialize the algorithm 



2.1 



as follows: for all 1 < A; < g, a^^ 



-1] 2[( 



0.6 a. 



2[-l] 



and fo", 



2[-l] 



IS 



estimated from a linear estimation (without the random effects) of the Lasso at the given 
penalty A. We will study in Section |^4] the influence of the initialization of the algorithm 
on simulated data. 

Note that Step 4(c) performs the selection on the random effects; we decide to delete a 
random effect when its variance became lower that 10~^a, 



2[t] 



The estimation of the set of parameters $ is biased (Zhang and Hunag 2008). One last 



step can be added in order to address this problem once both Algorithm 2A_ has converged 
and the penalization parameter A has be tuned. Indeed, one should prefer to use Algorithm 



2.1| in order to estimate both the support of (3 and the support of the random effects, and 
then to estimate the set $ with a classical mixed model estimation on the model: 



y 



XP j + ^ ZkUk + e. 



kas 
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where J and S are the estimated set of indices of the relevant fixed effects and the estimated 
set of indices of the relevant random effects respectively. 

Proposition 2.2. When the variances are known, the minimization of our objective func- 
tion ^ is the same as the minimization of Q{(3) = {y — X(3yV~^{y — X(3) + X\(3\i, which 



is the objective function of Schelldorfer et al. (2011) at known variances 



Let us recall that Schelldorfer et al. (2011) obtained theoretical results on the consis- 



tency of their method. According to Proposition |2.2 , these results apply to our method in 
the case of known variances. The proof of Proposition 2.2 is given in Appendix C. 



Note that when individuals are genetically related through a known relationship matrix 
A, we have u ~ A4(0, CT^A), with as > 0. Thanks to |Hendersoii| ( |1973D , A'^ can be directly 
computed. In all that precede, the changes are the following : the matrix F becomes the 
matrix a'^/a'^A'^ and becomes u'A~ 



u. 



2.4 The tuning parameter 



Algorithms |2.1| involves a regularization parameter A; the solution depends on this pa- 
rameter. This amount of shrinkage has to be tuned. We choose the use of the Bayesian 



Information Criterion (BIG) (Schwarz, 1978): 



Ar grain \ log |Va| + (y — X(5\)'V) 

A 



-1 

A 



{y-X(3x)+d^Aog{n] 



where Vx = ^ki^ic^l^^^'k + ^l^n and (j1^a1^(3\ are obtained from the minimization of 
the objective function g defined by Q. Moreover, d\ := Yl^k=i lo-fcT^o + |<^a| is the sum of 
the number of non-zero variance-covariance parameters and the number of non-zero fixed 
effects coefficients included in the model which has been selected with the regularization 
parameter A. 

Other methods can be used to choose A such as AIC or cross-validation, among others. 
An advantage of BIG over cross-validation is mainly the gain of computational time. 



In the next section, we propose a generalization of Algorithm 2A which allows the use 
of any variable selection methods developed for linear models. 



3 A generalized algorithm 



Algorithm 2A_ gives good results, as it can be seen in the simulation study of Section |4j 
Nevertheless, since Step 2 of Algorithm 2A_ aims at selecting the relevant coefficients of /3 in 
a linear model, the Lasso method can be replaced with any variable selection method built 
for linear models. If the chosen variable selection method optimizes a criterion, such as the 



adaptive Lasso (Zou, 2006) or the elastic net (Zou and Hastie, 2005), the algorithm thus 



obtained remains a multicycle ECM algorithm and the convergence property still applies. 
However, the convergence property does not hold for methods that do not optimize a 
criterion. 



Algorithm 2.1 can be reshaped for a generalized algorithm as follows: 
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(o-f ,ae^M,/3M). 5e^/C = {l,...,g}. 

nq In^, where = 



Algorithm 3.1. Initialization: 
Initialize the set of parameters 

Define as the block diagonal matrix of '^^^I^^, . . . ,'-fq'lN^, where Yk' = 
Define Z as the concatenation of Zi, . . . , Zg and u = {u[, . . . , u'^)' . 
Until convergence: 

1. yi^+m = (^z'z + rW)-'^'(i/ - 

2. Variable selection and estimation of (3 in the linear model y — Z-u'*^^/^! 
where e'*! 

3. = 



2[t] 



X/3 + eW, 



Ar(0, a, 

[z'z + rW)-iz'(y-x/3[*+i^ 



^. (a) For k in )C, set cr^'*^"*^^ 



2(i+l) 
' e 



(b) Set a. 

(c) For k in K,, if 



1 

n L 



u 



y - 

2 



/Nk < 10' 



then /C = /C\ {A;} 



a. 



Define Z as the concatenation of {Zk}f,^j^ and u as the transpose of the concatenation of 



Set rl*+^l as the block diagonal matrix of 



7^ 

end 



CTe 



2[t+l] 



kaK. 



, where for all k E IC, 



We choose to initialize Algorithm 3.1 as follows: for all 1 < /c < g, a 



'-^ of 

q 



1] 2H 
5 <->e 



0.6 (Te' ^\ and (ue' ^\ P^^^) is estimated from a linear estimation (without the random ef- 
fects) of the method used at Step 2. 



In the following we propose to combine Algorithm 2.1 with a method that does not need 
a tuning parameter, namely the procbol method (Rohart 2011). The procbol method is 
a sequential multiple hypotheses testing which statistically determines the set of relevant 
variables in a linear model y = X/3 + e where e is an i.i.d Gaussian noise. This method 
is a two-step procedure: the first step orders the variables taking into account the obser- 
vations y and the second step uses multiple hypotheses testing to separate the relevant 
variables from the irrelevant ones. The procbol method is proved to be powerful under 
some conditions on the signal in Rohart (2011). 

In Section [4], we show that the combination of Algorithm 3A_ and the procbol method 
performs well on simulated data. 



4 Simulation study 

The purpose of this section is to compare different methods that aim at selecting both the 
correct fixed effects coefficients and the relevant random effects in a linear mixed model 
([T]), but also to look at the improvement obtained from including random effects in the 
model. 



4.1 Presentation of the methods 



We compare several methods, some of them are designed to work in a linear model: 
Lasso (Tibshirani, 1996), adLasso (Zou, 2006) and procbol (Rohart, 2011), while oth- 



ers are designed to work in a linear mixed model: ImmLasso ( jSchelldorfer et aL 2011), 
Algorithm 2.1 (labelled as Lasso+), adLasso+Algorithm 3.1 (labelled as adLasso+) and 
procbol+ Algorithm 3.1 (labelled as pbol+). 

The initial weights of the adLasso and adLasso+ are set to be equal to l/|/3i| where for 
all i G {l,...,p}, Pi is the Ordinary Least Squares (OLS) estimate of (5^ in the model 

y = XiPi + ei. 

The second step of the procbol method performs multiple hypotheses testing with an 
estimation of unknown quantiles related to the matrix X. The calculation of these quantiles 
at each iteration of the convergence process would make the combination of the procbol 
method and Algorithm almost impossible to run; however, since the data matrix X stays 
the same throughout the algorithm, the quantiles also do. Thus the procbol method was 
adapted to be run several times on the same data set by keeping the calculated quantiles, 
which led to a enormous gain of computational time. Some parameters of the procbol 
method were changed in order to limit the time of one iteration of the convergence process, 
as follows. The parameter m which stands for the number of bootstrapped samples used to 
sort the variables (first step of the procbol method) was set to 10. The number of variables 
ordered at the first step of the procbol method was set to 40. Note that when the procbol 
method was used in a linear model, we set m = 100 as advised in Rohart ( |2011 ). Both the 
procbol method and the pbol+ method were set with a user-level of a G {0.1, 0.05}, which 
stands for the level of the testing procedure. 

Concerning all methods that needed a tuning parameter, we set it using the Bayesian 
Information Criterion described in Section 2.4 A particular attention has to be drawn on 
the tuning of the regularization parameter of some methods that could be tricky in some 
cases due to the degeneracy of the likelihood, especially Lasso and adLasso, see Appendix 
B. 



4.2 Design of our simulation study 

Concerning the design of our simulations, we set Xi to be the vector of whose coor- 
dinates are all equal to 1 and we considered four models. For each model, the response 
variable y is computed via y = X]^=i ^i^Pij + ZlLi ^kUk + e, where J = {ii, . . . , ^5} C 
{1, . . . ,p}, with two random effects (g = 2) being standard Gaussian (af = (t| = 1) and e 
being a vector of independent standard Gaussian variables. The models used to fit the data 
differ in the number of parameters p, the number of random effects q and the dependence 
structure of the Xj's. For each model, we have that for all j = 2, . . . ,p: Yl^=i ~ ^'^'^ 
n Sr=i -^ji ~ ^- -^^^ k = 1, . . . ,q, the random effects regression matrix corresponds to 
the design matrix of the interaction between the k^^ column of X and the grouping factor 
k, which gives a n x Nk matrix. The design of the matrices Z^s means that the first q 
grouping variables generates both a fixed effect (corresponds to /3fc's) and a random effect 



(corresponds to Mfc's). As advised in Schelldorfer et al. (2011), the variables that generate 



both a fixed and a random effect do not undergo feature selection; otherwise the fixed 
effect coefficients of those variables tends to be shrunken towards 0. The set of variables 
that do not undergo feature selection can change at each step of the convergence process of 
our algorithms. Indeed, as soon as a variable does not generate a random effect anymore, 
the fixed effect corresponding to that variable undergoes feature selection again. 
The models are defined as follows: 
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• Ml. n= 120, p = 80, /3j = 2/3. For all j = 2, . . . ,p,Xj ~ A/;(0, /„).Tlie division 
of the observations for the two random effects are the same; for all k < 2 : N^. = 
20, Vi G {1, .., 20} rij = 6. This model is fitted assuming g = 3. 

• M2: n = 120, p = 300, Pj = 3/4. The covariates are generated from a multivari- 
ate normal distribution with mean zero and covariance matrix S with the pairwise 
correlation Sfc^/ = pl'^"'^ I and p = 0.5. The division of the observations for the two 
random effects are the same; for all k < 2 : Nk = 20, Vz G {1, .., 20} rii^k = 6. 

• M3: n = 120, p = 300, Pj = 2/3. For all j = 2,...,p,Xj ~ A/;(0,/„). The 
division of the observations for the two random effects are different: A^^i = 20, V« G 
{1,..,20} = 6 and A^2 = 15,Vi G {1,..,15} ^^,3 = 8 

• M4: n = 120, p = 600, /3j = 2/3. For all j = 2, . . . ,p, Xj ^ Afn{0, /„). The division 
of the observations for the two random effects are the same; for all k < 2 : Nk = 
20,Vz G {1,..,20} n,,fc = 6. 

For models Mi, M3, M4, we set J = {1, . . . , 5}. For model M2, we set J = {1, 2, ^3, ^4, i^} 
where {13,1^,15} C {3, . . . ,p}. 

In each model, the aim is to recover both the set of relevant fixed effects coefficients J 
and the set of relevant random effects; but also to estimate the variance of both the random 
effects and the residuals. To judge the quality of the methods, we use several criterion: the 
percentage of true model recovered under the label 'Truth' (both J and the set of relevant 
random effects), the percentage of times the true set of fixed effects is recovered ' J = J', 
the cardinal of the estimated set of fixed effects coefficients | J|, the number of true positive 
TP, the estimated variance of the residuals, the estimated variances a^, . . . , of the 
random effects and the mean squared error mse calculated as an £^ error rate between the 
reality and the estimation -Xf3-. We also calculated the Signal-to-Noise Ratio (SNR) 
as 1 1-^/3 1 11/ 1 1 Ylk=i ^kUk + e||2 ^ach of the replications. 



4.3 Comments on the results 

The detailed results of the simulation study are available in Appendix A. A summary of 
the main results is shown in Figure [l] (a = 0.1 for the procbol method and the pbol+ 
method). No results are given for the ImmLasso of Schelldorfer et al. (2011) in Model M3 
since two different grouping factors are considered and the R-package ImmLasso does not 
include that setting. 



In all models, there is an improvement of the results when we switch from a simple 
linear model to a linear mixed model; indeed there is a significant difference between Lasso 
and Lasso+ or procbol and pbol+, especially with model M4. 

On all models, ImmLasso and Lasso + give very similar results; this is not surprising 
since both are a ^^-penalization of the log likelihood, except for model Mi where ImmLasso 
seems to give better results. This difference comes from the coding of the R-package 
that contains the ImmLasso method. Indeed, a variable that generates both a fixed and 
a random effect does not undergo feature selection in the ImmLasso method when the 
random effect tends towards zero, whereas the Lasso+ method would allow it. 
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Is) |b) 




1 2 3 4 1 2 3 4 



|c) 




12 3 4 



Figure 1: Summary of the results of the simulation study for models Mi — M4 (X axis). 
Results of 'Truth' (a), ' J = J' (b) and Mean Squared Error (c) for each model. 



We observed on our simulation study that both ImmLasso and Lasso + are very sensitive 
to the choice of the regularization parameter. On most simulations of model M4 in which 
p = 600, we observed an edge effect between a regularization parameter that selects few 
fixed effects (fewer than 15) and a regularization parameter that selects too much fixed- 
effects (|J| > n) and thus stops the algorithm because we assumed that the number of 
relevant fixed-effects is lower than min(n — l,p), see Figure 4.3 Nevertheless, the weights 
included in the adLasso+ seems to smooth this phenomenon, see Figure 4.3 for the same 
simulation as Figure 4^ Remark that for the run of model M4 which is on Figure [2| 



Lasso+ could select the true model for a regularization parameter around 0.22 whereas 
adLasso+ could not as a noisy variable enters the set of selected variables before all the 
relevant fixed-effects do. 

Concerning the adLasso+ method, it appears to improve the Lasso + method, except 
for model M4 where the true model is only selected once over the 100 replications. On 
this particular model M4, adLasso+ selects more fixed effects but less relevant ones than 
Lasso+. This could mean that the initial weights are not adapted to this case. Despite 
the result of 'Truth', the mse is lower for adLasso+ than for Lasso+. 

Algorithm 3.1 combined with the procbol method {pbol+) gives the best results over 
all tested methods for all models. Indeed the percentage of true model recovered is the 
largest over all methods, the estimation of the fixed effects is really close to the reality 
and the mse is the lowest among the tested methods. Nevertheless, due to the bias of the 
Lasso, the results in term of mse for Lasso+ and ImmLasso could easily be improved with 
a linear mixed model estimation as said in Section 2.3 (see Appendix). Yet, the results 
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Lasso+ 



adLasso+ 




"T" 



"T" 




0.20 0.22 0.24 

Regularization parameter 



Reguiarization parameter 



Figure 2: Number of selected fixed effects coefficients depending on the value of the regular- 
ization parameter for one run of model M4, for the method (a) Lasso+ and (b) adLasso+. 
The grid of the penalty is as thin as 10"'' next to the area \ J\ > n in (a) and 10~^ in (b). 

of pbol+ are mitigated for model Mi. Indeed, the percentage of true model recovered is 
lower than in the other models because of the selection of the random effects that lacks 
efficiency (the results concerning the selection of the fixed-effects are equivalent as in the 
other models, as shown in Figure [T|. Nonetheless, the results are still better than for the 
others methods. Moreover, a relevant random effect was never falsely deleted in all models 
and for all methods. It is interesting to note that the pbol+ method always converged on 
our simulations. 



A R-package "MMS" is available on CRAN (http://cran.r-project.org). This package 



contains tools to perform fixed effects selection in linear mixed models; it contains the 

previous methods denoted as Lasso+, adLasso+, pbol+, among others. 

All the results presented in this section were obtained with a specific initialization of the 

algorithms. The next paragraph is dedicated to the analysis of the influence of that specific 

initialization. 



4.4 Influence of the initialization of our algorithms 



1' • 



Both Algorithm 2.1 and Algorithm 3.1 start with an initialization of the parameter $ 



/3). We choose to initialize each algorithm with the following setting: for 
all 1 < A; < g, a^'°^ = ^ cxe^ crf'°^ = 0.6 al^ ^\ and {al^ ^\ is estimated from a linear 
estimation (without the random effects) of the method used at Step 2. 



In the current Section, we choose different initializations of Algorithm 2T and Algorithm 
3.1, both on Model M4 (see Section |4]). The initial values of the variances were set from 
0.1 to 10 and of the fixed effects coefficients from —100 to 100. Each algorithm always 
converged towards the same point, whatever the initialization of $, not shown. However, 
the farther is set from the true estimation of $, the higher is the number of iterations 
of the algorithms. 
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1^1 




^1 




Lasso 


14 




-2 






adLasso 


21 


3.4 X 10- 


-2 


_ 


_ 


procbol 


11 


4.1 X 10- 


-2 


- 


- 


Lasso+ 


11 


3.2 X 10- 


-2 


3.2 X 10-3 


6.4 X 10-3 


adLasso+ 


10 


3.3 X 10- 


-2 


2.5 X 10-3 


6.5 X 10-3 


pbol+ 


5 


3.4 X 10- 


-2 


5.9 X 10-3 


6.5 X 10-3 



Table 1: Results for the real data set 



Methods 


CPU Time 


Lasso+ 


0.80 


ImmLasso 


24.28 



Table 2: CPU Time on a single run that selects the same model 



5 Application on a real data-set 



In this section we analyze a real data set which comes from Rohart et al. (2012). The aim 
of this analysis is to pinpoint metabolomic data that describes a phenotype taking into 
account all the available information such as the breed, the batch effect and the relationship 
between individuals. Here we will study the Daily Feed Intake phenotype (DPI). We model 
the data as follows: 

y = XbPb + XmPm + ZeUe + ZfUf + e, (7) 

where y is the DPI phenotype, Xb, Xm, Ze, Zf are the design matrices of the breed effect, 
the metabolomic data, the batch effect and the family effect, respectively. We consider two 
random effects: the batch and the family, considering that each level of these factors is a 
random sample drawn from a much larger population of batches and families, contrary to 
the breed factor. Note that the coefficients /3b do not undergo feature selection. 
We compare several methods on this model: Lasso, adLasso, procbol. Lasso +, adLasso + 
and pbol+ (see Section |4]). The model which is considered for the first three methods is 
y = XbPb + XmPm + Both methods procbol and pbol+ were set with a user-level of 
a = 0.1. The results are presented in Table [TJ 

We observe that considering random effects leads to a decrease of both the residual vari- 
ance and the number of selected metabolomic variables. This behavior is in accordance 
with the simulation study. The question that arises from this analysis is to know whether 
the variables which are selected in the linear mixed models are more relevant than in the 
linear model. Biological analyses remain to be done to answer that question. 



Table |2] gives the computational time of one run when we only consider the batch effect 
-in order to be able to compute the ImmLasso-, showing that the Lasso+ method is much 
faster than the ImmLasso method for a large number of observations (due to the inversion 
of the matrix of variance V at each step of the convergence process). The simulation was 
performed at a regularization parameter that selects the same model for the two methods, 
on a 2.80GHz CPU with 8.OOG0 of RAM. 
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6 Conclusion 



In this paper, we proposed to add a ^^-penalization of the complete log-hkelihood in order 
to perform selection of the fixed effects in a linear mixed model. The multicycle ECM 
algorithm used to minimize the objective function also performs random effects selection. 
This algorithm gives the same results as the ImmLasso of Schelldorfer et al. (2011) when 
the random effects are assumed to be independent, but faster. Theoretical results are 
identical to those of Schelldorfer et al. (2011 ) when the variances are known. The structure 
of our algorithm gives the possibility to combine it with any variable selection method 
built for linear models, but at the price of possibly loosing the convergence property. 
Nonetheless, the combined procbol method appears to give good results on simulated data 
and outperforms other approaches. 

We apphed all these methods to a real data set showing that the residual variance can be 
reduced, even with a small set of selected variables. 
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Appendix A - Results of the simulation study 



Table 3: Results of model Mi. The percentage of true model recovered was recorded -'Truth'- 
as well as J = J. \J\ is the number of fixed effects selected and TP the number of relevant fixed 
effects selected. The signal to noise ratio is equal to SNR = 0.78(0.13). Standard errors are 
given in parentheses, for 100 runs. 





Truth 


J = J 


|J| 


TP 






al 




Ideal 


1 


5 


5 


5 


1 


1 


1 





Lasso 


- 


0.15 


4.95 


4.13 


3.27 


- 


- 


- 








(1.90) 


(1.12) 


(0.62) 


- 


- 


- 


adLasso 


- 


0.16 


5.25 


4.26 


2.91 


- 


- 


- 








(1.84) 


(0.89) 


(0.59) 


- 


- 


- 


procbol 


- 


0.59 


4.70 


4.58 


2.83 


- 


- 


- 


a = 0.1 






(0.78) 


(0.61) 


(0.57) 


- 


- 


- 


procbol 


- 


0.45 


4.47 


4.40 


2.89 


- 


- 


- 


a = 0.05 






(0.67) 


(0.62) 


(0.58) 








TjasRO-l- 

-1— Jc^oovy 1 


0.21 


0.34 


6.42 


5.00 


1.04 


0.88 


0.98 


0.02 








fl 64) 


(0.00) 


fO 21) 


(0 37) 


(0 44) 


(0.06) 


adLasso+ 


0.21 


n 3^ 


fi 34 


4 qq 


n 94 




n 9^1 










(1.41) 


(0.10) 


(0.18) 


(0.36) 


(0.41) 


(0.06) 


ImmLasso 


0.29 


3Q 




^ no 






n Qfi 










(1.29) 


(0.00) 


(0.19) 


(0.38) 


(0.42) 


(0.06) 


pbol+ 


0.55 


0.89 


5.18 


5.00 


0.92 


0.87 


0.97 


0.03 


a = 0.1 






(0.50) 


(0.00) 


(0.18) 


(0.37) 


(0.41) 


(0.06) 


pbol+ 


0.59 


0.93 


5.08 


5.00 


0.93 


0.88 


0.97 


0.03 


a = 0.05 






(0.30) 


(0.00) 


(0.17) 


(0.37) 


(0.41) 


(0.06) 






Pi 






k 


h 


MSE 




Ideal 


0.67 


0.67 


0.67 


0.67 


0.67 


0.00 




Lasso 


0.67 


0.29 


0.31 


0.41 


0.17 


0.79 








(0.27) 


(0.26) 


(0.20) 


(0.19) 


(0.16) 


(0.42) 




adLasso 


0.69 


0.42 


0.46 


0.58 


0.27 


0.60 








(0.27) 


(0.33) 


(0.25) 


(0.23) 


(0.22) 


(0.37) 




procbol 


0.69 


0.63 


0.68 


0.65 


0.49 


0.44 




a = 


= 0.1 


(0.27) 


(0.32) 


(0.17) 


(0.30) 


(0.33) 


(0.31) 




procbol 


0.69 


0.63 


0.68 


0.62 


0.43 


0.51 




a = 


= 0.05 


(0.27) 


(0.32) 


(0.17) 


(0.33) 


(0.36) 


(0.30) 




Lasso+ 


0.69 


0.65 


0.49 


0.41 


0.43 


0.35 








(0.25) 


(0.28) 


(0.17) 


(0.11) 


(0.11) 


(0.17) 




adLasso+ 


0.69 


0.64 


0.59 


0.57 


0.48 


0.26 








(0.25) 


(0.27) 


(0.15) 


(0.12) 


(0.14) 


(0.15) 




ImmLasso 


0.69 


0.65 


0.66 


0.41 


0.43 


0.30 








(0.25) 


(0.28) 


(0.11) 


(0.11) 


(0.10) 


(0.15) 




pbol+ 


0.69 


0.67 


0.67 


0.66 


0.65 


0.19 




a = 


= 0.1 


(0.25) 


(0.28) 


(0.12) 


(0.10) 


(0.10) 


(0.14) 




pbol+ 


0.69 


0.67 


0.67 


0.66 


0.65 


0.18 




a = 


= 0.05 


(0.25) 


(0.28) 


(0.11) 


(0.10) 


(0.10) 


(0.13) 
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Table 4: Results of model M2. The percentage of true model recovered was recorded -'Truth'- 
as well as J = J. \ J\ is the number of fixed effects selected and TP the number of relevant fixed 
effects selected. The signal to noise ratio is equal to SNR = 1.02(0.21). Standard errors are 
given in parentheses, for 100 runs. 



Results 


Truth 


J = J 


\J\ 


TP 








Ideal 


1 


5 


5 


5 


1 


1 


1 


Lasso 


- 


0.11 


5.02 


3.86 


3.62 


- 


- 








(2.69) 


(1.35) 


(0.96) 


- 


- 


adLasso 


- 


0.09 


6.06 


4.24 


3.05 


- 


- 








(2.66) 


(1.16) 


(0.87) 


- 


- 


procbol 


- 


0.24 


3.95 


3.76 


3.62 


- 


- 


a = 0.1 






(1.22) 


(1.06) 


(0.95) 


- 


- 


procbol 


- 


0.21 


3.60 


3.47 


3.53 


- 


- 


a = 0.05 






(1.25) 


(1.14) 


(0.87) 


- 


- 


Lasso+ 


0.17 


0.17 


7.60 


4.92 


1.25 


0.91 


0.93 








(2.64) 


(0.37) 


(0.28) 


(0.40) 


(0.48) 


adLasso+ 


0.08 


0.08 


8.26 


5.00 


0.99 


0.90 


0.85 








(3.15) 


(0.00) 


(0.21) 


(0.38) 


(0.41) 


ImmLasso 


0.17 


0.17 


7.65 


4.93 


1.24 


0.91 


(0.93) 








(2.49) 


(0.36) 


(0.26) 


(0.40) 


(0.48) 


pbol+ 


0.91 


0.91 


4.86 


4.85 


1.01 


0.95 


0.88 


a = 0.1 






(0.59) 


(0.58) 


(0.28) 


(0.38) 


(0.41) 


pbol+ 


0.80 


0.80 


4.57 


4.57 


1.11 


0.93 


0.88 


a = 0.05 






(0.93) 


(0.93) 


(0.39) 


(0.38) 


(0.39) 





k 




Pis 


Pu 


Pi, 


MSE 


Ideal 


0.75 


0.75 


0.75 


0.75 


0.75 


0.00 


Lasso 


0.79 


0.47 


0.21 


0.19 


0.17 


1.19 




(0.27) 


(0.31) 


(0.19) 


(0.17) 


(0.16) 


(0.57) 


adLasso 


0.79 


0.64 


0.36 


0.35 


0.29 


0.84 




(0.27) 


(0.38) 


(0.24) 


(0.24) 


(0.22) 


(0.55) 


procbol 


0.79 


0.72 


0.50 


0.57 


0.52 


0.82 


a = 0.1 


(0.27) 


(0.49) 


(0.40) 


(0.38) 


(0.38) 


(0.55) 


procbol 


0.79 


0.75 


0.44 


0.50 


0.45 


0.93 


a = 0.05 


(0.27) 


(0.50) 


(0.42) 


(0.41) 


(0.40) 


(0.56) 


Lasso+ 


0.82 


0.91 


0.35 


0.35 


0.33 


0.54 




(0.26) 


(0.26) 


(0.13) 


(0.11) 


(0.13) 


(0.24) 


adLasso+ 


0.81 


0.82 


0.51 


0.52 


0.49 


0.33 




(0.25) 


(0.25) 


(0.14) 


(0.13) 


(0.14) 


(0.17) 


ImmLasso 


0.82 


0.91 


0.35 


0.35 


0.33 


0.53 




(0.26) 


(0.26) 


(0.13) 


(0.11) 


(0.13) 


(0.23) 


pbol+ 


0.79 


0.76 


0.70 


0.73 


0.72 


0.23 


a = 0.1 


(0.25) 


(0.26) 


(0.22) 


(0.17) 


(0.18) 


(0.28) 


pbol+ 


0.80 


0.79 


0.64 


0.66 


0.66 


0.35 


a = 0.05 


(0.25) 


(0.28) 


(0.29) 


(0.28) 


(0.28) 


(0.43) 
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Table 5: Results of model M3. The percentage of true model recovered was recorded -'Truth'- 
as well as J = J. \ J\ is the number of fixed effects selected and TP the number of relevant fixed 
effects selected. The signal to noise ratio is equal to SNR = 0.83(0.16). Standard errors are 
given in parentheses, for 100 runs. 



Results 


Truth 


J = J 


\J\ 


TP 








Ideal 


1 


5 


5 


5 


1 


1 


1 


Lasso 


- 


0.22 


4.96 


4.13 


3.32 


- 


- 








(2.18) 


(1.10) 


(0.80) 


- 


- 


adLasso 


- 


0.20 


6.10 


4.58 


2.85 


- 


- 








(2.19) 


(0.70) 


(0.72) 


- 


- 


procbol 


- 


0.28 


4.37 


4.12 


2.90 


- 


- 


ex — u. 1 








(C\ 77\ 

') 








procbol 


- 


0.26 


4.17 


3.97 


2.97 


- 


- 


a = 0.05 






(1.12) 


(0.83) 


(0.82) 


- 


- 


Lasso+ 


0.20 


0.20 


7.07 


4.99 


1.11 


0.91 


0.92 








(2.01) 


(0.10) 


(0.22) 


(0.36) 


(0.46) 


adLasso+ 


0.24 


0.24 


6.70 


4.97 


0.97 


0.88 


0.88 








(1.51) 


(0.17) 


(0.19) 


(0.34) 


(0.45) 


ImmLasso 
















pbol+ 


0.93 


0.93 


5.09 


5.00 


0.95 


0.91 


0.89 


a = 0.1 






(0.38) 


(0.00) 


(0.17) 


(0.33) 


(0.44) 


pbol+ 


0.95 


0.95 


5.08 


5.00 


0.95 


0.91 


0.89 


a = 0.05 






(0.44) 


(0.00) 


(0.17) 


(0.33) 


(0.44) 





Pi 


P2 


Ps 


P4 


P5 


MSE 


Ideal 


0.67 


0.67 


0.67 


0.67 


0.67 


0.00 


Lasso 


0.69 


0.69 


0.18 


0.20 


0.27 


0.90 




(0.25) 


(0.32) 


(0.17) 


(0.17) 


(0.17) 


(0.40) 


adLasso 


0.69 


0.68 


0.32 


0.36 


0.46 


0.60 




(0.25) 


(0.32) 


(0.21) 


(0.21) 


(0.22) 


(0.32) 


procbol 


0.73 


0.65 


0.48 


0.51 


0.57 


0.63 


a = 0.1 


(0.34) 


(0.13) 


(0.36) 


(0.36) 


(0.35) 


(0.42) 


procbol 


0.73 


0.65 


0.44 


0.49 


0.56 


0.68 


a = 0.05 


(0.34) 


(0.13) 


(0.38) 


(0.38) 


(0.36) 


(0.43) 


Lasso+ 


0.71 


0.71 


0.40 


0.38 


0.43 


0.41 




(0.24) 


(0.29) 


(0.12) 


(0.11) 


(0.11) 


(0.19) 


adLasso+ 


0.71 


0.69 


0.50 


0.48 


0.56 


0.30 




(0.24) 


(0.29) 


(0.16) 


(0.14) 


(0.13) 


(0.18) 


ImmLasso 














pbol+ 


0.71 


0.69 


0.67 


0.65 


0.68 


0.19 


a = 0.1 


(0.24) 


(0.29) 


(0.12) 


(0.10) 


(0.10) 


(0.16) 


pbol+ 


0.71 


0.69 


0.67 


0.65 


0.68 


0.19 


a = 0.05 


(0.24) 


(0.29) 


(0.12) 


(0.10) 


(0.10) 


(0.16) 
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Table 6: Results of model M4. The percentage of true model recovered was recorded -'Truth'- 
as well as J = J. \ J\ is the number of fixed effects selected and TP the number of relevant fixed 
effects selected. The signal to noise ratio is equal to SNR = 0.63(0.11). Standard errors are 
given in parentheses, for 100 runs. 



Results 


Truth 


J = J 


\J\ 


TP 








Ideal 


1 


5 


5 


5 


1 


1 


1 


Lasso 


- 


0.00 


2.81 


2.06 


4.08 


- 


- 








(2.80) 


(1.30) 


(0.84) 


- 


- 


adLasso 


- 


0.00 


5.64 


3.03 


3.38 


- 


- 








(4.10) 


(1.22) 


(0.88) 


- 


- 


procbol 


- 


0.15 


3.85 


3.61 


3.23 


- 


- 


a = 0.1 






(1.00) 


(0.95) 


(0.73) 


- 


- 


procbol 


- 


0.15 


3.48 


3.34 


3.39 


- 


- 


a = 0.05 






(1.00) 


(0.99) 


(0.80) 


- 


- 


Lasso+ 


0.25 


0.25 


7.13 


4.99 


1.21 


0.93 


1.03 








(1.84) 


(0.10) 


(0.27) 


(0.41) 


(0.40) 


adLasso+ 


0.01 


0.01 


9.56 


4.87 


0.94 


0.89 


0.98 








(4.01) 


(0.37) 


(0.26) 


(0.37) 


(0.37) 


ImmLasso 


0.25 


0.25 


7.22 


4.99 


1.19 


0.93 


1.03 








(1.95) 


(0.10) 


(0.25) 


(0.40) 


(0.40) 


pbol+ 


0.82 


0.82 


5.21 


4.99 


0.92 


0.97 


1.00 


a = 0.1 






(0.56) 


(0.10) 


(0.17) 


(0.39) 


(0.34) 


pbol+ 


0.88 


0.88 


5.10 


4.98 


0.93 


0.97 


1.00 


a = 0.05 






(0.41) 


(0.14) 


(0.16) 


(0.39) 


(0.34) 





Pi 


P2 


Ps 


P4 


P5 


MSE 


Ideal 


0.67 


0.67 


0.67 


0.67 


0.67 


0.00 


Lasso 


0.60 


0.06 


0.06 


0.06 


0.11 


1.27 




(0.25) 


(0.15) 


(0.11) 


(0.11) 


(0.15) 


(0.32) 


adLasso 


0.60 


0.15 


0.18 


0.17 


0.26 


0.99 




(0.25) 


(0.26) 


(0.19) 


(0.19) 


(0.23) 


(0.33) 


procbol 


0.60 


0.55 


0.38 


0.44 


0.43 


0.83 


a = 0.1 


(0.25) 


(0.31) 


(0.38) 


(0.39) 


(0.40) 


(0.43) 


procbol 


0.60 


0.53 


0.32 


0.38 


0.35 


0.91 


a = 0.05 


(0.25) 


(0.32) 


(0.38) 


(0.40) 


(0.41) 


(0.39) 


Lasso+ 


0.62 


0.55 


0.31 


0.35 


0.37 


0.46 




(0.25) 


(0.27) 


(0.11) 


(0.12) 


(0.12) 


(0.20) 


adLasso+ 


0.61 


0.56 


0.41 


0.43 


0.48 


0.39 




(0.25) 


(0.26) 


(0.16) 


(0.18) 


(0.16) 


(0.19) 


ImmLasso 


0.62 


0.55 


0.31 


0.35 


0.38 


0.45 




(0.25) 


(0.27) 


(0.11) 


(0.12) 


(0.12) 


(0.19) 


pbol+ 


0.60 


0.64 


0.67 


0.67 


0.67 


0.21 


a = 0.1 


(0.25) 


(0.28) 


(0.10) 


(0.11) 


(0.13) 


(0.15) 


pbol+ 


0.60 


0.64 


0.67 


0.67 


0.67 


0.20 


a = 0.05 


(0.25) 


(0.27) 


(0.10) 


(0.13) 


(0.13) 


(0.15) 
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Table 7: Results of model M4 when a ML linear regression is added after the convergence of the 
algorithm. The percentage of true model recovered was recorded -'Truth'- as well as J = J. \J\ 
is the number of fixed effects selected and TP the number of relevant fixed effects selected. The 
signal to noise ratio is equal to SNR = 0.63(0.11). Standard errors are given in parentheses, for 
100 runs. 





Ideal 


ImmLasso 


Lasso-|- 


Truth 

J-l LIU 11 


1 




25 


f — T 


1 

J. 


n 9^ 


95 


1 /I 




7 22f1 QB"! 

1 .£j£j 1 -1 .iJXj i 


7 13n 84) 


TP 


5 


4.99(0.10) 


4.99(0.10) 




1 


1.19(0.25) 


1.21(0.27) 




1 


0.96(0.39) 


0.96(0.40) 


al 


1 


1.01(0.36) 


1.01(0.36) 




0.67 


0.61(0.25) 


0.61(0.25) 




0.67 


0.62(0.28) 


0.62(0.28) 




0.67 


0.61(0.12) 


0.61(0.12) 




0.67 


0.63(0.12) 


0.63(0.12) 


h 


0.67 


0.62(0.14) 


0.62(0.14) 


mse 





0.40(0.17) 


0.40(0.17) 



23 



Appendix B - Remark on the tuning parameter 



The tuning of the regularization parameter could be tricky for some methods, especially 
the Lasso method and the adLasso method. In this section, we look at the causes. 

We shall begin to consider the classical linear model before studying the linear mixed 
model. Let us first look at the Lasso method when only applied in a classical linear model. 



We compare two penalizations of the likelihood: BIG and the Extended BIG (EBIG) ( Ghen 



and Ghen 2008). The EBIG penalizes a space of dimension k with a term that depends on 



the number of spaces that have the same dimension, which is -^^^^r^\ ] thus EBIG penalizes 
more the complex spaces than BIG. Figure [3] shows the behavior of the BIG and EBIG 
criteria, the log-likelihood and the residual variance for several values of the regularization 
parameter of the Lasso in a low dimensional case {p = 80). We observe that tuning the 
regularization parameter in this case raises no problem. 



OX) 



— \ 1 \ \ — 

0.2 0.4 0.6 08 
Regularization parameter 



o BIC 
+ EBIC 



(a) BIC or EBIC depending on the value of 
the regularization parameter of the Lasso 
method 




(b) —2x log-Likelihood depending on the (c) Residual variance depending on the reg- 
regularization parameter of the Lasso ularization parameter of the Lasso method 
method 



Figure 3: One simulation of linear model for the Lasso method with n = 120, p = 
80 and /3j = 1. 



24 



Let us now consider a simulation in a high dimensional context in which we have n = 120 
observations and p = 600 explanatory variables. Results of the two methods for choosing 
the regularization parameter of Lasso are presented in Figure |4j 




(a) BIC or EBIC depending on the value of 
the regularization parameter of the Lasso 




(b) — 2xlog-Likelihood depending on the (c) Residual variance depending on the reg- 
regularization parameter of the Lasso ularization parameter of the Lasso method 
method 



Figure 4: One simulation of linear model for the Lasso method with n = 120, p = 
600 and /3j = 1. 



Firstly, we confirm that EBIC is more conservative than BIC and penalizes more the 

we observe that both the BIC and the EBIC 



complex spaces. On the far left of Figure 4(a^ 



curves decrease when the regularization parameter is close to zero. This phenomenon is 
due to the degeneracy of the likelihood that can be seen in Figure 4(b)| (s tated in Section 
^ for mixed models, it can also happen in linear models). Figure 4(c)| shows that the 
degeneracy of the likelihood comes from the residual variance that drops to zero when 
the regularization parameter is close to zero, and thus when too much variables enter the 
model. 

To conclude, we see that both BIC and EBIC penalties are not sufficiently strong to 
completely balance the degeneracy of the likelihood; however, EBIC penalty leads to select 
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a more parsimonious model while BIG penalty selects a more complex model. Nonetheless, 
the EBIC penalty is usually too much conservative in practice, that is why the simulation 
study used the BIG penalty. When the degeneracy happens -as it is likely to occur as 
p grows-, the regularization parameter should be optimized over an area that does not 
contain the explosion of the likelihood, that means that the area should not contain the 
far left part of Figure |4(a)| where the criterion decreases. 



We now look at the Lasso+ method. As mentioned in the paper, the maximal number 
of fixed-effects that can be selected with the Lasso + method is small compared to n or 
p. Thus, the degeneracy of the likelihood never occurred in our simulations (Figure [s]). 
However, if this phenomenon happens, the choice of the grid of the regularization parameter 
should follow the same advice as the one given above for the classical linear model. 




0.21 0.22 0.23 

Regularization parameter 



(a) BIG or EBIC depending on the value of 
the regularization parameter of the Lasso+ 
method 




0.20 0.21 0.22 0.23 0.24 

Regularization parameter 

(b) — 2xlog-Likelihood depending on the 
regularization parameter of the Lasso+ 
method 




~r 



0.20 0.21 0.22 0.23 

Regularization parameter 




\ T 

0.20 0.21 0.22 0.23 

Reguianzation parameter 



(c) Residual variance depending on the (d) Residual variance depending on the 
regularization parameter of the Lasso+ regularization parameter of the Lasso+ 
method method 

Figure 5: One simulation of linear mixed model with n = 120, p = 600, /3j 
and two i.i.d. random effects. 
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Appendix C - Proof of Proposition 2.2 



G and R are supposed to be known. Thus the minimization of our objective function g 

reduces to the minimization of the following function in {(3,u): 

h[u,P) = {y - X 13 - Zu)' R-\y - X 13 - Zu) + u'G-^u + X\l3\i. 

Let denote {u,/3) = argmin h{u,P). Since the function h is convex, we have: 

{«,/9) 

u{f3) = argmin h{u, (3) 

u 

= I /3 = argmin . 

u = u{(3) 

Since — ^ ' exists, we can explicit the minimum of h in u: 
ou 

( u{[3) = {Z'R~^Z + G'^y^Z'R-^{y - 
^) ^ J /3 = argmin h{u{f3), (3) 

[ u = u0) 
Thus, we obtain: 



h{u{l3),(3) = {y-XP-Zu{(3))'R-\y-X(3-Zu{P))+u'G-\ + X\P\i 

= {y- Xf3)'R~\y - Xf3) - {y - X f3)R-^ Zu{f3) - {Zu{f3))' R-\y - Xf3) 

+{Zu)'R-'^Zu{(3) + u{(3yG-^u{(3) + A|/3|i 
= {y- X(3)' [R-^ - R-^Z{Z'R-^Z + G'^y^Z'R-^] {y - X/3) + A|/3|i 

Denote W = R-^ ~ R-^Z{Z'R-^Z + G'^y^Z'R-^ We can show that W = {Z'GZ + 
R"^)^^ = V^^. This result comes from the equivalence between the resolution of Hender- 
son's equations (Henderson, 1973) and the generalized least squares. 
To conclude, we have that 

{u, (3) = (^Z'R-^Z + G-^y^Z'R-\y - argmin {y - XI3)'V~^{y - Xf3) + \\f3\i^ . 
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